class: center, middle, inverse, title-slide # Potential outcomes framework, RCT and power analysis ## Tutorial 3 ### Stanislav Avdeev --- # The fundamental problem of causal inference - The main goal we have in doing causal inference is in making *as good a guess as possible* as to what that `Y` *would have been* if `X` had been different - That "would have been" is called a *counterfactual* - counter to the fact of what actually happened - In doing so, we want to think about two people/firms/countries that are basically *exactly the same* except that one has `X=0` and one has `X=1` --- # Potential outcomes - The logic we just went through is the basis of the *potential outcomes model*, which is one way of thinking about causality - We can't observe the counterfactual, and must make an estimate of what the *outcome* would *potentially* have been under the counterfactual - Figuring out that makes a good counterfactual estimate is a key part of causal inference --- # Randomized experiments - A common way to do causal inference in many fields is an *experiment* - If you can *randomly assign* `X`, then you know that the people with `X=0` are, on average, exactly the same as the people with `X=1` - When we're working with people/firms/countries, running experiments is often infeasible, impossible, or unethical - So we have to think hard about a *model* of what the world looks like - So that we can use our model to figure out what the *counterfactual* would be --- # DGP - In causal inference, the *model* is our idea of what we think the process is that *generated the data* - We have to make some assumptions about what this is - We put together what we know about the world with assumptions and end up with our model - The model can then tell us what kinds of things could give us wrong results so we can fix them and get the right counterfactual --- # DGP: simulation - Let's simulate a dataset where we know the DGP - Let's say that getting `X` causes `Y` to increase by 1 - And let's run a randomized experiment of who actually gets X ```r df <- data.frame(Y.without.X = rnorm(1000),X=sample(c(0,1),1000,replace=T)) %>% mutate(Y.with.X = Y.without.X + 1) %>% #Now assign who actually gets X mutate(Observed.Y = ifelse(X==1,Y.with.X,Y.without.X)) #And see what effect our experiment suggests X has on Y df %>% group_by(X) %>% summarize(Y = mean(Observed.Y)) ``` ``` ## # A tibble: 2 × 2 ## X Y ## <dbl> <dbl> ## 1 0 -0.0224 ## 2 1 1.04 ``` --- # DGP: simulation - Now this time we can't randomize X ```r df <- data.frame(Z = runif(10000)) %>% mutate(Y.without.X = rnorm(10000) + Z, Y.with.X = Y.without.X + 1) %>% #Now assign who actually gets X mutate(X = Z > .7,Observed.Y = ifelse(X==1,Y.with.X,Y.without.X)) df %>% group_by(X) %>% summarize(Y = mean(Observed.Y)) ``` ``` ## # A tibble: 2 × 2 ## X Y ## <lgl> <dbl> ## 1 FALSE 0.339 ## 2 TRUE 1.83 ``` --- # DGP: simulation - But if we properly model the process and compare apples to apples ```r df %>% filter(abs(Z-.7)<.01) %>% group_by(X) %>% summarize(Y = mean(Observed.Y)) ``` ``` ## # A tibble: 2 × 2 ## X Y ## <lgl> <dbl> ## 1 FALSE 0.625 ## 2 TRUE 1.62 ``` --- # Identification - We have "identified" a causal effect if *the estimate that we generate gives us a causal effect* - In other words, *when we see the estimate, we can claim that it's isolating just the causal effect* - Simply looking at `lm(Y~X)` gives us the causal effect in the randomized-X case. `lm(Y~X)` **identifies** the effect of `\(X\)` on `\(Y\)` - But `lm(Y~X)` does *not* give us the causal effect in the non-randomized case we did. In that case, `lm(Y~X)` **does not identify** the causal effect, but the apples-to-apples comparison we did *does* identify the effect - Causal inference is all about figuring out *what calculation we need to do to identify that effect* --- # Identification - Identifying effects requires us to understand the **data generating process** - And once we understand that DGP, knowing what calculations we need to do to isolate our effect - Often these will require taking some conditional values (controls) - Or **isolating the variation we want** in some other way --- # Treatment effects - For any given treatment, there are likely to be *many* treatment effects - Different individuals will respond to different degrees (or even directions) - This is called *heterogeneous treatment effects* - When we identify a treatment effect, what we're *estimating* is some mixture of all those individual treatment effects - But what kind of mixture? - What we get depends on *the research design itself* as well as *the estimator we use to perform that design* --- # Individual treatment effects - While we can't always estimate it directly, the true regression model becomes something like $$ Y = \alpha + X' \beta_i + U $$ - `\(\beta_i\)` follows its own distribution across individuals (notice subscript `\(i\)`) - Remember, this is theoretical - we'd still have those individual `\(\beta_i\)`s even with one observation per individual and no way to estimate them separately --- # Distribution of the treatment effects - There are methods that try to give us the whole distribution of effects - But often we only get a single effect, `\(\hat{\beta}\)`. - This `\(\hat{\beta}\)` is some summary statistic of the `\(\beta_i\)` distribution. But *what* summary statistic? - Average treatment effect: the mean of `\(\beta_i\)` - Conditional average treatment effect (CATE): the mean of `\(\beta_i\)` *conditional on some value* (say, "just for men", i.e. conditional on being a man) --- # Conditional average treatment effects - The ATE among some demographic group - The ATE among some specific group (conditional average treatment effect) - The ATE just among people who were actually treated (ATET) - The ATE just among people who were NOT actually treated (ATUT) --- # Treatment effects - Which average you'd *want* depends on what you'd want to do with it - Want to know how effective a treatment *was* when it was applied? Average treatment effect on the treated - Want to know how effective a treatment would be if applied to everyone/at random? Average treatment effect - Want to know how effective a treatment would be if applied *just a little more broadly?* Local average treatment effect (next lecture) --- # Treatment effects - Different treatment effect averages aren't *wrong* but we need to pay attention to which one we're getting, or else we may apply the result incorrectly - A result could end up representing a different group than you're really interested in - There are technical ways of figuring out what average you get, and also intuitive ways --- # Heterogeneous treatment effects: simulation - Let's simulate some data and see what different methods give us - We'll start with some basic data where the effect is already identified - The effect varies according to a normal distribution, which has mean 5 for group A and mean 7 for group B (mean = 6 overall) - Random assignment / an experimental setting ```r set.seed(1000) tb <- tibble(group = sample(c('A','B'), 5000, replace = TRUE), W = rnorm(5000, mean = 0, sd = sqrt(8))) %>% mutate(beta1 = case_when( group == 'A' ~ rnorm(5000, mean = 5, sd = 2), group == 'B' ~ rnorm(5000, mean = 7, sd = 2))) %>% mutate(X = rnorm(5000)) %>% mutate(Y = beta1*X + rnorm(5000)) ``` --- # Heterogeneous treatment effects: simulation - We're already identified, so let's just regress `\(Y\)` on `\(X\)` <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> Model 1 </th> <th style="text-align:center;"> Model 2 </th> <th style="text-align:center;"> Model 3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:center;"> 0.025 </td> <td style="text-align:center;"> 0.026 </td> <td style="text-align:center;"> 0.051 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.034) </td> <td style="text-align:center;"> (0.044) </td> <td style="text-align:center;"> (0.045) </td> </tr> <tr> <td style="text-align:left;"> X </td> <td style="text-align:center;"> 5.975*** </td> <td style="text-align:center;"> 4.993*** </td> <td style="text-align:center;"> 6.873*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.035) </td> <td style="text-align:center;"> (0.045) </td> <td style="text-align:center;"> (0.045) </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001</td></tr></tfoot> </table> - We get 5.975, pretty close to the true average treatment effect of 6 - Note the standard error is nothing like the standard deviation of the treatment effect - those are measuring two very different things --- # Performing an experiment 1. Figure out what needs to be randomized (the treatment, usually) 1. Figure out what you want your outcome to be 1. **Figure out where you'll do the experiment and how much data you need** 1. Figure out how to randomize the treatment 1. Perform the experiment and collect data 1. Check balance 1. Analyze the data 1. Check for threats to the experiment --- # Power analysis - *Statistical power* - having a large enough sample that we can have a reasonable expectation of finding a result if it's there - In experiments, we have some control over our sample size - So *before* collecting any data, we need to do a *power analysis* - what sample size do we need to get our standard errors down to a useful level? - Power analysis also applies to observational data/non-experimental data too, it's not specific to experiments - We just don't do it as often because we can't control the sample size anyway, and it's easier to get huge samples - You still huge samples to reasonably study small effects - So don't pursue effects that are likely to be really tiny, or at least tinier than your sample can handle - If we ran the underpowered study anyway and *do* get a significant result, it would be more likely to be a false positive than a true positive. That's low power --- # Power analysis - Using X as shorthand for the treatment and Y as shorthand for the outcome, assuming we’re doing a power analysis for the a study of the relationship between X and Y, power analysis balances five things: 1. size of the effect (coefficient in a regression, a correlation, etc.) 1. amount of variation in the treatment (the variance of X, say) 1. amount of other variation in Y (the R2, or the variation from the residual after explaining Y with X, etc.) 1. power (the standard error of the estimate, statistical power, i.e. the true-positive rate) 1. sample size --- # Power analysis: implementation - In order to do power analysis, you need to be able to fill in the values for four of those five pieces, so that power analysis can tell you the fifth one 1. Use previous research results about these values 1. Use standard practice for statistical power. In the past, a goal of 80% statistical power has been standard. These days 90% used a lot more often - You can use a standard formula to calculate your outcomes `$$\text{MDE} = (t_{1-\alpha/2} - t_{1-q})\sqrt{\frac{1}{p(1-p)}} \sqrt{\frac{\sigma^2}{n}}$$` - Functions for doing power analysis calculations in R: like `power.t.test()` or the **powerMediation** package - Simulations --- # Power analysis: simulation - The idea of a power analysis is: 1. to have data with certain properties (variance of X, size of the effect, etc.) 1. to use certain analytic methods (regression, etc.) 1. to make some claims about the sampling variation of the estimator (statistical power, size of standard errors, etc.) - That's what simulations allow you to do --- # Power analysis: simulation Save your results to use them afterwards: ```r set.seed(777) coef_results <- c() sig_results <- c() for (i in 1:2000) { # Have to re-create the data EVERY TIME or it will just be the same data over and over tib <- tibble( X = runif(1000, 0, 1) ) %>% mutate(Y = .2*X + rnorm(1000, mean = 0, sd = 3)) # Run the analysis model <- feols(Y ~ X, data = tib, se = 'hetero') # Get the results coef_results[i] <- coef(model)[2] sig_results[i] <- tidy(model)$p.value[2] <= .05 } ``` --- # Power analysis: simulation - Our estimate of statistical power is the proportion of the results that are significant: ```r mean(sig_results) ``` ``` ## [1] 0.098 ``` - So we have statistical power of 9.80% - We might also want to look at the distribution of the coefficient itself - The standard deviation of the coefficient across all the simulated runs gives you a good idea of what the standard error of the coefficient will be (`sd(coef_results)`, which gives us `\(\hat{\sigma}_{\beta} =\)` 0.33). --- # Power analysis: simulation Check the distribution of the effects <!-- --> --- # Power analysis: simulation Check the distribution of the significance levels <!-- --> --- # Power analysis: simulation - The main goal of power analysis is to calculate the *minimum detectable effect* or *smallest sample size* for a given power level - How can we do that here? By trying different values of effect size and sample size and seeing what we get - To do this, we're first going to take everything we've done so far and put it inside a *function* that we can call --- # Power analysis: simulation ```r my_power_function <- function(effect, sample_size) { sig_results <- c() for (i in 1:500) { # Have to re-create the data EVERY TIME or it will just be the same data over and over tib <- tibble( X = runif(sample_size, 0, 1) ) %>% mutate(Y = effect*X + rnorm(sample_size, mean = 0, sd = 3)) # Run the analysis model <- feols(Y ~ X, data = tib, se = 'hetero') # Get the results sig_results[i] <- tidy(model)$p.value[2] <= .05 } sig_results %>% mean() %>% return() } ``` --- # Power analysis: simulation Now we can just call the function, setting `effect` and `sample_size` to whatever we want, and get the power back! Let's check it with the values we had before and make sure we're in the same range: ```r my_power_function(.2, 1000) ``` ``` ## [1] 0.08 ``` Seems good --- # Power analysis: simulation - Now let's say we really are stuck with a sample size of 1000 and we want to know the minimum detectable effect size we can get a power of .8 with. - To do this, we can just run our function with `sample_size = 1000` and a bunch of different `effect` values until we get back a power above .8 ```r power_levels <- c() effects_to_try <- c(.4, .8, 1.2, 1.6, 2) for (i in 1:5) { power_levels[i] <- my_power_function(effects_to_try[i], 1000) } # Where do we cross 80%? power_results <- tibble(effect = effects_to_try, power = power_levels) power_results ``` ``` ## # A tibble: 5 × 2 ## effect power ## <dbl> <dbl> ## 1 0.4 0.212 ## 2 0.8 0.72 ## 3 1.2 0.962 ## 4 1.6 0.996 ## 5 2 1 ``` ```r ggplot(power_results, aes(x = effect, y = power)) + geom_line(color = 'red', size = 1.5) + # add a horizontal line at 90% geom_hline(aes(yintercept = .8), linetype = 'dashed') + # Prettify! theme_minimal() + scale_y_continuous(labels = scales::percent) + labs(x = 'Linear Effect Size', y = 'Power') ``` <!-- --> --- # Power analysis: simulation <!-- --> - So it looks like we need an effect somewhere between .8 and 1.2 to have an 80% chance of finding a significant result - If we don't think the effect is actually likely to be that large, then we need to figure out something else to do - find a bigger sample, use a more precise estimation method, something! Or else we should probably walk away.